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Abstract 

HITIQA  is  an  interactive  question  answering 
technology  designed  to  allow  intelligence  analysts 
and  other  users  of  information  systems  to  pose 
questions  in  natural  language  and  obtain  relevant 
answers,  or  the  assistance  they  require  in  order  to 
perform  their  tasks.  Our  objective  in  HITIQA  is  to 
allow  the  user  to  submit  exploratory,  analytical, 
non-factual  questions,  such  as  “What  has  been 
Russia’s  reaction  to  U.S.  bombing  of  Kosovo?" 
The  distinguishing  property  of  such  questions  is 
that  one  cannot  generally  anticipate  what  might 
constitute  the  answer.  While  certain  types  of  things 
may  be  expected  (e.g.,  diplomatic  statements),  the 
answer  is  heavily  conditioned  by  what  information 
is  in  fact  available  on  the  topic.  From  a  practical 
viewpoint,  analytical  questions  are  often  under- 
specified,  thus  casting  a  broad  net  on  a  space  of 
possible  answers.  Therefore,  clarification  dialogue 
is  often  needed  to  negotiate  with  the  user  the  exact 
scope  and  intent  of  the  question. 

1  Introduction 

HITIQA  project  is  part  of  the  ARDA  AQUAINT 
program  that  aims  to  make  significant  advances  in 
the  state  of  the  art  of  automated  question  answer¬ 
ing.  In  this  paper  we  focus  on  two  aspects  of  our 
work: 

1.  Question  Semantics:  how  the  system  “un¬ 
derstands”  user  requests. 

2.  Human- Computer  Dialogue:  how  the  user 
and  the  system  negotiate  this  understand¬ 
ing. 

We  will  also  discuss  very  preliminary  evalua¬ 
tion  results  from  a  series  of  pilot  tests  of  the  system 
conducted  by  intelligence  analysts  via  a  remote 
internet  link. 

2  Factual  vs.  Analytical 


The  objective  in  HITIQA  is  to  allow  the  user  to 
submit  and  obtain  answers  to  exploratory,  analyti¬ 
cal,  non-factual  questions.  There  are  very  signifi¬ 
cant  differences  between  factual,  or  fact-finding, 
and  analytical  question  answering.  A  factual  ques¬ 
tion  seeks  pieces  of  information  that  would  make  a 
corresponding  statement  true  (i.e.,  they  become 
facts):  “How  many  states  are  in  the  U.S.?”  /  “There 
are  X  states  in  the  U.S.”  In  this  sense,  a  factual 
question  usually  has  just  one  correct  answer  that 
can  generally,  be  judged  for  its  truthfulness.  By 
contrast,  an  analytical  question  is  when  the  “truth” 
of  the  answer  is  more  a  matter  of  opinion  and  may 
depend  upon  the  context  in  which  the  question  is 
asked.  Answers  to  analytical  questions  are  rarely 
unilateral,  indeed,  a  mere  “correcf’  answer  may 
have  limited  value,  and  in  some  cases  may  not 
even  be  determinate  (“Which  college  is  the  best?”, 
“How  do  I  stop  my  baby’s  crying?”).  Instead,  an¬ 
swers  to  analytical  questions  are  often  judged  as 
helpful,  or  useful,  or  satisfactory,  etc.  “Technically 
correcf’  answers  (e.g.,  “feed  the  baby  milk”)  may 
be  considered  as  irrelevant  or  at  best  unresponsive. 

The  distinction  between  factual  and  analytical 
questions  depends  primarily  on  the  intention  of  the 
person  who  is  asking,  however,  the  form  of  a  ques¬ 
tion  is  often  indicative  of  which  of  the  two  classes 
it  is  more  likely  to  belong  to.  Factual  questions 
can  be  classified  into  a  number  of  syntactic  formats 
(“question  typology”)  that  aids  in  automatic  proc¬ 
essing. 

Factual  questions  display  a  fairly  distinctive 
“answer  type”,  which  is  the  type  of  the  information 
piece  needed  to  fulfill  the  statement.  Recent  auto¬ 
mated  systems  for  answering  factual  questions 
deduct  this  expected  answer  type  from  the  form  of 
the  question  and  a  finite  list  of  possible  answer 
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types.  For  example,  “Who  was  the  first  man  in 
space”  expects  a  “person”  as  the  answer,  while 
“Flow  long  was  the  Titanic?”  expects  some  length 
measure  as  an  answer,  probably  in  yards  and  feet, 
or  meters.  This  is  generally  a  very  good  strategy, 
that  has  been  exploited  successfully  in  a  number  of 
automated  QA  systems  that  appeared  in  recent 
years,  especially  in  the  context  of  TREC  QA' 
evaluations  (Flarabagiu  et  ah,  2000;  Flovy  et  al., 
2000;  Prager  at  al.,  2001). 

This  process  is  not  easily  applied  to  analytical 
questions.  This  is  because  the  type  of  an  answer  for 
analytical  questions  cannot  always  be  anticipated 
due  to  their  inherently  exploratory  character.  In 
contrast  to  a  factual  question,  an  analytical  ques¬ 
tion  has  an  unlimited  variety  of  syntactic  forms 
with  only  a  loose  connection  between  their  syntax 
and  the  expected  answer.  Given  the  unlimited  po¬ 
tential  of  the  formation  of  analytical  questions,  it 
would  be  counter-productive  to  restrict  them  to  a 
limited  number  of  question/answer  types.  Even 
finding  a  non- strictly  factual  answer  to  an  other¬ 
wise  simple  question  about  Titanic  length  (e.g., 
“two  football  fields”)  would  push  the  limits  of  the 
answer-typing  approach.  Therefore,  the  formation 
of  an  answer  should  instead  be  guided  by  the  top¬ 
ics  the  user  is  interested  in,  as  recognized  in  the 
query  and/or  through  the  interactive  dialogue, 
rather  than  by  a  single  type  as  inferred  from  the 
query  in  a  factual  system. 

This  paper  argues  that  the  semantics  of  an  ana¬ 
lytical  question  is  more  likely  to  be  deduced  from 
the  information  that  is  considered  relevant  to  the 
question  than  through  a  detailed  analysis  of  their 
particular  form.  While  this  may  sound  circular,  it 
needs  not  be.  Determining  “relevant”  information 
is  not  the  same  as  finding  an  answer;  indeed  we 
can  use  relatively  simple  information  retrieval 
methods  (keyword  matching,  etc.)  to  obtain  per¬ 
haps  50  or  100  “relevanf’  documents  from  a  data¬ 
base.  This  gives  us  an  initial  answer  space  to  work 
on  in  order  to  determine  the  scope  and  complexity 
of  the  answer.  In  our  project,  we  use  structured 
templates,  which  we  call  frames  to  map  out  the 
content  of  pre-retrieved  documents,  and  subse¬ 
quently  to  delineate  the  possible  meaning  of  the 
question  (Section  6). 


*  TREC  QA  is  the  annual  Question  Answering  evalua¬ 
tion  sponsored  by  the  U.S.  National  Institute  of  Stan¬ 
dards  and  Technology  www.trec.nist.gov. 


3  Document  Retrieval 

When  the  user  poses  a  question  to  a  system  sitting 
atop  a  huge  database  of  unstructured  data  (text 
files),  the  first  order  of  business  is  to  reduce  that 
pile  to  perhaps  a  handful  of  documents  where  the 
answer  is  likely  to  be  found.  This  means,  most  of¬ 
ten,  document  retrieval,  using  fast  but  non-exact 
selection  methods.  Questions  are  tokenized  and 
sent  to  a  document  retrieval  engine,  such  as  Smart 
(Buckley,  1985)  or  InQuery  (Callan  et  al.,  1992). 
Noun  phrases  and  verb  phrases  are  extracted  from 
the  question  to  give  us  a  list  of  potential  topics  that 
the  user  may  be  interested  in. 

In  the  experiments  with  the  HITIQA  prototype, 
see  Figure  1,  we  are  retrieving  the  top  fifty  docu¬ 
ments  from  three  gigabytes  of  newswire 
(AQUAINT  corpus  plus  web-harvested  docu¬ 
ments). 


Figure  1:  HITIQA  preliminary  architecture 


4  Data  Driven  Semantics  of  Questions 

The  set  of  documents  and  text  passages  returned 
from  the  initial  search  is  not  just  a  random  subset 
of  the  database.  Depending  upon  the  quality  (recall 
and  precision)  of  the  text  retrieval  system  avail- 


able,  this  set  can  be  considered  as  a  first  stab  at 
understanding  the  user’s  question  by  the  machine. 
Again,  given  the  available  resources,  this  is  the 
best  the  system  can  do  under  the  circumstances. 
Therefore,  we  may  as  well  consider  this  collection 
of  retrieved  texts  {the  Retrieved  Set)  as  the  mean¬ 
ing  of  the  question  as  understood  by  the  system. 
This  is  a  fair  assessment:  the  better  our  search  ca¬ 
pabilities,  the  closer  this  set  would  be  to  what  the 
user  may  accept  as  an  answer  to  the  question. 

We  can  do  better,  however.  We  can  perform 
automatic  analysis  of  the  retrieved  set,  attempting 
to  uncover  if  it  is  a  fairly  homogenous  bunch  (i.e., 
all  texts  have  very  similar  content),  or  whether 
there  are  a  number  of  diverse  topics  represented 
there,  somehow  tied  together  by  a  common  thread. 
In  the  former  case,  we  may  be  reasonably  confi¬ 
dent  that  we  have  the  answer,  modulo  the  retriev¬ 
able  information.  In  the  latter  case,  we  know  that 
the  question  is  more  complex  than  the  user  may 
have  intended,  and  a  negotiation  process  is  needed. 

We  can  do  better  still.  We  can  measure  how 
well  each  of  the  topical  groups  within  the  retrieved 
set  is  “matching  up”  against  the  question.  This  is 
accomplished  through  a  framing  process  described 
later  in  this  paper.  The  outcome  of  the  framing 
process  is  twofold:  firstly,  the  alternative  interpre¬ 
tations  of  the  question  are  ranked  within  3  broad 
categories:  on-target,  near-misses  and  outliers. 
Secondly,  salient  concepts  and  attributes  for  each 
topical  group  are  extracted  into  topic  frames.  This 
enables  the  system  to  conduct  a  meaningful  dia¬ 
logue  with  the  user,  a  dialogue  which  is  wholly 
content  oriented,  and  thus  entirely  data  driven. 


Figure  2:  Answer  Space  Topology.  The  goal  of  interac¬ 
tive  QA  it  to  optimize  the  ON-TARGET  middle  zone. 

5  Clustering 

We  use  n-gram-based  clustering  of  text  passages 
and  concept  extraction  to  uncover  the  main  topics, 
themes  and  entities  in  this  set. 


Retrieved  documents  are  first  broken  into  natu¬ 
rally  occurring  paragraphs.  Duplicate  paragraphs 
are  filtered  out  and  the  remaining  passages  are 
clustered  using  a  combination  of  hierarchical  clus¬ 
tering  and  n-bin  classification  (details  of  the  clus¬ 
tering  algorithm  can  be  found  in  Hardy  et  ah, 
2002a).  Typically  three  to  six  clusters  are  gener¬ 
ated  out  of  the  top  50  documents,  which  may  yield 
as  many  as  1000  passages.  Each  cluster  represents 
a  topic  theme  within  the  retrieved  set:  usually  an 
alternative  or  complimentary  interpretation  of  the 
user’s  question. 

A  list  of  topic  labels  is  assigned  to  each  cluster. 
A  topic  label  may  come  from  one  of  two  places: 
First,  the  texts  in  the  cluster  are  compared  against 
the  list  of  key  phrases  extracted  from  the  user’s 
query.  For  each  match  found,  the  matching  phrase 
is  used  as  a  topic  label  for  the  cluster.  If  a  match 
with  the  key  phrases  from  the  question  cannot  be 
obtained,  Wordnet  is  consulted  to  see  if  a  common 
ancestor  can  be  found.  For  example,  “rifle”  and 
“machine  gun”  are  kinds  of  “weaponry”  in  Word- 
net,  which  allows  an  indirect  match  between  a 
question  about  weapon  inspectors  and  a  text  re¬ 
porting  a  discovery  by  the  authorities  of  a  cache  of 
“rifles”  and  “machine  guns”. 

6  Framing 

In  HITIQA  we  use  a  text  framing  technique  to  de¬ 
lineate  the  gap  between  the  meaning  of  the  user’s 
question  and  the  system  “understanding”  of  this 
question.  The  framing  is  an  attempt  to  impose  a 
partial  structure  on  the  text  that  would  allow  the 
system  to  systematically  compare  different  text 
pieces  against  each  other  and  against  the  question, 
and  also  to  communicate  with  the  user  about  this. 
In  particular,  the  framing  process  may  uncover 
topics  and  themes  within  the  retrieved  set  which 
the  user  has  not  explicitly  asked  for,  and  thus  may 
be  unaware  of  their  existence.  Nonetheless  these 
may  carry  important  information  -  the  NEAR- 
MISSES  in  Figure  2. 

In  the  current  version  of  the  system,  frames  are 
fairly  generic  templates,  consisting  of  a  small 
number  of  attributes,  such  as  LOCATION,  PERSON, 
COUNTRY,  ORGANIZATION,  etc.  Future  versions  of 
HITIQA  will  add  domain  specialized  frames,  for 
example,  we  are  currently  constructing  frames  for 
the  Weapons  Non-proliferation  Domain.  Most  of 
the  frame  attributes  are  defined  in  advance,  how- 


ever,  dynamic  frame  expansion  is  also  possible. 
Each  of  the  attributes  in  a  frame  is  equipped  with 
an  extractor  function  which  specializes  in  locating 
and  extracting  instances  of  this  attribute  in  the  run¬ 
ning  text.  The  extractors  are  implemented  using 
information  extraction  utilities  which  form  the  ker¬ 
nel  of  Sheffield’s  GATE^  system.  We  have  modi¬ 
fied  GATE  to  separate  organizations  into  compa¬ 
nies  and  other  organizations,  and  we  have  also  ex¬ 
panded  by  adding  new  concepts  such  as  industries. 
Therefore,  the  framing  process  resembles  strongly 
the  template  filling  task  in  information  extraction 
(cf  MUC^  evaluations),  with  one  significant  ex¬ 
ception:  while  the  MUC  task  was  to  fill  in  a  tem¬ 
plate  using  potentially  any  amount  of  source  text 
(Humphreys  et  al.,  1998),  the  framing  is  essentially 
an  inverse  process.  In  framing,  potentially  multiple 
frames  can  be  associated  with  a  small  chunk  of  text 
(a  passage  or  a  short  paragraph).  Furthermore,  this 
chunk  of  text  is  part  of  a  cluster  of  very  similar  text 
chunks  that  further  reinforce  some  of  the  most  sali¬ 
ent  features  of  these  texts.  This  makes  the  frame 
filling  a  significantly  less  error-prone  task  -  our 
experience  has  been  far  more  positive  than  the 
MUC  evaluation  results  may  indicate.  This  is  be¬ 
cause,  rather  than  trying  to  find  the  most  appropri¬ 
ate  values  for  attributes  from  among  many  poten¬ 
tial  candidates,  we  in  essence  fit  the  frames  over 
small  passages"^. 

Therefore,  data  frames  are  built  from  the  re¬ 
trieved  data,  after  clustering  it  into  several  topical 
groups.  Since  clusters  are  built  out  of  small  text 
passages,  we  associate  a  frame  with  each  passage 
that  serves  as  a  seed  of  a  cluster.  We  subsequently 
merge  passages,  and  their  associated  frames  when¬ 
ever  anaphoric  and  other  cohesive  links  are  de¬ 
tected. 

A  very  similar  process  is  applied  to  the  user’s 
question,  resulting  in  a  Goal  Frame  which  can  be 
subsequently  compared  to  the  data  frames  obtained 
from  retrieved  data.  For  example,  the  Goal  Frame 
generated  from  the  question,  ""How  has  pollution  in 
the  Black  Sea  affected  the  fishing  industry,  and 


^  GATE  is  Generalized  Architecture  for  Text  Engineering,  an 
information  extraction  system  developed  at  the  University  of 
Sheffield  (Cunningham,  2000). 

^  MUC,  the  Message  Understanding  Conference,  funded  by 
AREA,  involved  the  evaluation  of  information  extraction  sys¬ 
tems  applied  to  a  common  task. 

We  should  note  that  selecting  the  right  frame  type  for  a  pas¬ 
sage  is  an  important  pre-condition  to  “understanding”. 


what  are  the  sources  of  this  pollution?”  is  shown 
in  Figure  3  below. 


TOPIC: [pollution,  industry,  sources] 
Location:  [Black  Sea] 

ImiOSTKY :  [fishing] 

Figure  3:  HITIQA  generated  Goal  Frame 


TOPIC:  pollution 
SUB-TOPIC:  [sources] 

LOCATION:  [Black  Sea] 

Industry  '.[fisheries,  tourism] 

TEXT:  [In  a  period  of  only  three  decades  (1960's-1980's), 
the  Black  Sea  has  suffered  the  catastrophic  degradation 
of  a  major  part  of  its  natural  resources.  Particularly  acute 
problems  have  arisen  as  a  result  of  pollution  (notably 
from  nutrients,  fecal  material,  solid  waste  and  oil),  a 
catastrophic  decline  in  commercial  fish  stocks,  a  severe 
decrease  in  tourism  and  an  uncoordinated  approach  to¬ 
wards  coastal  zone  management.  Increased  loads  of  nutri¬ 
ents  from  rivers  and  coastal  sources  caused  an  overpro¬ 
duction  of  phytoplankton  leading  to  extensive  eutrophica¬ 
tion  and  often  extremely  low  dissolved  oxygen  concentra¬ 
tions.  The  entire  ecosystem  began  to  collapse.  This  prob¬ 
lem,  coupled  with  pollution  and  irrational  exploitation  of 
fish  stocks,  started  a  sharp  decline  in  fisheries  resources] 
Relevance:  Matches  on  all  elements  found  in  goalframe 

Figure  4:  A  HITIQA  generated  data  frame.  Words  in 
bold  were  used  to  fill  the  Frame. 

The  data  frames  are  then  compared  to  the  Goal 
Frame.  We  pay  particular  attention  to  matching  the 
topic  attributes,  before  any  other  attributes  are  con¬ 
sidered.  If  there  is  an  exact  match  between  a  Goal 
Frame  topic  and  the  text  being  used  to  build  the 
data  frame,  then  this  becomes  the  data  frame’s 
topic  as  well.  If  more  than  one  match  is  found,  the 
subsequent  matches  become  the  sub-topics  of  the 
data  frame.  On  the  other  hand,  if  no  match  is  pos¬ 
sible  against  the  Goal  Frame  topic,  we  choose  the 
topic  from  the  list  of  the  Wordnet  generated  hy- 
pemyms.  An  example  data  frame  generated  from 
the  text  retrieved  in  response  to  the  query  about  the 
Black  Sea  is  shown  in  Figure  4.  After  the  initial 
framing  is  done,  frames  judged  to  be  related  to  the 
same  concept  or  event,  are  merged  together  and 
values  of  their  attributes  are  combined. 

7  Judging  Frame  Relevance 

We  judge  a  particular  data  frame  as  relevant,  and 
subsequently  the  corresponding  segment  of  text  as 
relevant,  by  comparison  to  the  Goal  Frame.  The 


data  frames  are  scored  based  on  the  number  of 
conflicts  found  between  them  and  the  Goal  Frame. 
The  conflicts  are  mismatches  on  values  of  corre¬ 
sponding  attributes.  If  a  data  frame  is  found  to 
have  no  conflicts,  it  is  given  the  highest  relevance 
rank,  and  a  conflict  score  of  zero.  All  other  data 
frames  are  scored  with  an  incrementing  conflict 
value,  one  for  frames  with  one  conflict  with  the 
Goal  Frame,  two  for  two  conflicts  etc.  Frames  that 
conflict  with  all  information  found  in  the  query  are 
given  a  score  of  99  indicating  the  lowest  relevancy 
rank.  Currently,  frames  with  a  conflict  score  of  99 
are  excluded  from  further  processing.  The  frame  in 
Figure  4  is  scored  as  fully  relevant  to  the  question 
(0  conflicts). 

8  Enabling  Dialogue  with  the  User 

Framed  information  allows  HITIQA  to  automati¬ 
cally  judge  some  text  as  relevant  and  to  conduct  a 
meaningful  dialogue  with  the  user  as  needed  on 
other  text.  The  purpose  of  the  dialogue  is  to  help 
the  user  to  navigate  the  answer  space  and  to  solicit 
from  the  user  more  details  as  to  what  information 
he  or  she  is  seeking.  The  main  principle  here  is  that 
the  dialogue  is  at  the  information  semantic  level, 
not  at  the  information  organization  level.  Thus,  it  is 
okay  to  ask  the  user  whether  information  about  the 
AIDS  conference  in  Cape  Town  should  be  in¬ 
cluded  in  the  answer  to  a  question  about  combating 
AIDS  in  Africa.  However,  the  user  should  never  be 
asked  if  a  particular  keyword  is  useful  or  not,  or  if 
a  document  is  relevant  or  not.  We  have  developed 
a  3 -pronged  strategy: 

1.  Narrowing  dialogue:  ask  questions  that 
would  allow  the  system  to  reduce  the  size 
of  the  answer  set. 

2.  Expanding  dialogue:  ask  questions  that 
would  allow  the  system  to  decide  if  the  an¬ 
swer  set  needs  to  be  expanded  by  informa¬ 
tion  just  outside  of  it  (near- misses). 

3.  Fact  seeking  dialogue:  allow  the  user  to 
ask  questions  seeking  additional  facts  and 
specific  examples,  or  similar  situations. 

Of  the  above,  we  have  thus  far  implemented  the 
first  two  options  as  part  of  the  preliminary  clarifi¬ 
cation  dialogue.  The  clarification  dialogue  is  when 
the  user  and  the  system  negotiate  the  task  that 
needs  to  be  performed.  We  can  call  this  a  “triaging 
stage”,  as  opposed  to  the  actual  problem  solving 
stage  (point  3  above).  In  practice,  these  two  stages 


are  not  necessarily  separated  and  may  be  overlap¬ 
ping  throughout  the  entire  interaction.  Nonetheless, 
these  two  have  decidedly  distinct  character  and 
require  different  dialogue  strategies  on  the  part  of 
the  system. 

Our  approach  to  dialogue  in  HITIQA  is  mod¬ 
eled  to  some  degree  upon  the  mixed-initiative  dia¬ 
logue  management  adopted  in  the  AMITIES  pro¬ 
ject  (Hardy  et  al.,  2002b).  The  main  advantage  of 
the  AMITIES  model  is  its  reliance  on  data-driven 
semantics  which  allows  for  spontaneous  and  mixed 
initiative  dialogue  to  occur. 

By  contrast,  the  major  approaches  to  implemen¬ 
tation  of  dialogue  systems  to  date  rely  on  systems 
of  functional  transitions  that  make  the  resulting 
system  much  less  flexible.  In  the  grammar-based 
approach,  which  is  prevalent  in  commercial  sys¬ 
tems,  such  as  in  various  telephony  products,  as 
well  as  in  practically  oriented  research  prototypes^, 
(e.g.,  DARPA,  2002;  Seneff  and  Polifoni,  2000; 
Ferguson  and  Allen,  1998)  a  complete  dialogue 
transition  graph  is  designed  to  guide  the  conversa¬ 
tion  and  predict  user  responses,  which  is  suitable 
for  closed  domains  only.  In  the  statistical  variation 
of  this  approach,  a  transition  graph  is  derived  from 
a  large  body  of  annotated  conversations  (e.g.. 
Walker,  2000;  Eitman  and  Pan,  2002).  This  latter 
approach  is  facilitated  through  a  dialogue  annota¬ 
tion  process,  e.g.,  using  Dialogue  Act  Markup  in 
Several  Payers  (DAMSE)  (Allen  and  Core,  1997), 
which  is  a  system  of  functional  dialogue  acts. 

Nonetheless,  an  efficient,  spontaneous  dialogue 
cannot  be  designed  on  a  purely  functional  layer. 
Therefore,  here  we  are  primarily  interested  in  the 
semantic  layer,  that  is,  the  information  exchange 
and  information  building  effects  of  a  conversation. 
In  order  to  properly  understand  a  dialogue,  both 
semantic  and  functional  layers  need  to  be  consid¬ 
ered.  In  this  paper  we  are  concentrating  exclusively 
on  the  semantic  layer. 

9  Clarification  Dialogue 

Data  frames  with  a  conflict  score  of  zero  form  the 
initial  kernel  answer  space.  Depending  upon  the 
size  of  this  set  and  the  presence  of  other  frames 
outside  of  it,  the  system  either  proceeds  to  generate 
the  answer  or  initiates  a  dialogue  with  the  user.  For 


^  A  notable  exception  is  CU  Communicator  developed 
at  University  of  Colorado  (Ward  and  Pellom,  1999) 


example,  if  the  answer  space  appears  too  large  or 
varied,  e.g.  consists  of  many  different  topics,  the 
system  may  ask  the  user  how  to  narrow  it.  Alterna¬ 
tively,  the  presence  of  large  groups  of  texts  frames 
with  near-miss  frames  assigned  to  them  (i.e., 
frames  with  1  or  2  conflicts  with  the  Goal  Frame) 
may  indicate  that  the  answer  space  is  actually  lar¬ 
ger,  and  the  user  will  be  consulted  about  a  possi¬ 
ble  broadening  of  the  question.  Currently,  we  only 
initiate  a  clarification  dialogue  for  1 -conflict 
frames. 

A  1 -conflict  frame  has  only  a  single  attribute 
mismatch  with  the  Goal  Frame.  This  could  be  a 
mismatch  on  any  attribute,  for  example,  LOCA¬ 
TION,  or  ORGANIZATION,  or  TIME,  etc.  A  special 
case  arises  when  the  conflict  occurs  on  the  TOPIC 
attribute.  Since  all  other  attributes  match,  we  may 
be  looking  at  potentially  different  events  or  situa¬ 
tions  involving  the  same  entities,  or  occurring  at 
the  same  location  or  time.  The  purpose  of  the  clari¬ 
fication  dialogue  in  this  case  is  to  probe  which  of 
these  topics  may  be  of  interest  to  the  user.  This  is 
illustrated  in  the  exchange  below  recorded  during 
an  evaluation  session  with  an  intelligence  analyst: 

User:  “Who  is  Elizardo  Sanchez?  ’’ 

FlITIQA.'  “Are  you  interested  in  seeing  information 
about  civil  rights  as  it  is  related  to  Elizardo  Sanchez? 


One-Conflict  Frame 

topic:  civil  rights 
SUB-TOPIC.-  [] 

LOCATION:  [Cuba,  Cuba.  Cuba] 

COMPANIES:  [] 

PEOPLE:  [Sanchez,  Sanchez] 

DOCUMENT  DATE:  [2000,  1,  II] 

SOURCE.-  HA  VANA  (AP)  - 

TEXT:  [''I  consider  that  the  situation  for  civil  and  political 
rights  in  Cuba  has  worsened  over  the  past  year...  owing  to 
that  Cuba  continues  to  be  the  only  closed  society  in  this  hemi¬ 
sphere,  "  Sanchez  said.  '  'There  have  been  no  significant  re¬ 
lease  of  prisoners,  the  number  of  people  sanctioned  or  proc¬ 
essed  for  political  motives  increased.  Sanchez,  who  himself 
spent  many  years  in  Cuban  prisons,  is  among  the  communist 
island's  best  known  opposition  activists.  The  commission  he 
heads  issues  a  report  on  civil  rights  every  six  months,  along 
with  a  list  of people  it  considers  to  be  imprisoned  for  political 
motives.  ] 

Figure  5:  One  of  the  Frames  that  were  used  in  generat¬ 
ing  Sanehez  dialogue.  Words  in  bold  were  used  to  fill 
the  Frame. 

In  order  to  understand  what  happened  here,  we 
need  to  note  first  that  the  Goal  Frame  for  the  user 


question  does  not  have  any  specific  value  assigned 
to  its  TOPIC  attribute.  This  of  course  is  as  we  would 
expect  it:  the  question  does  not  give  us  a  hint  as  to 
what  information  we  need  to  look  for  or  may  be 
hoping  to  find  about  Sanchez.  This  also  means  that 
all  the  text  frames  obtained  from  the  retrieved  set 
for  this  question  will  have  at  least  one  conflict, 
near-misses.  One  such  text  frame  is  shown  in  Fig¬ 
ure  5:  its  topic  is  “civil  rights”  and  it  about  San¬ 
chez.  HITIQA  thus  asks  if  “civil  rights”  is  a  topic 
of  interest  to  the  user.  If  the  user  responds  posi¬ 
tively,  this  topic  will  be  added  to  the  answer  space. 

The  above  dialogue  strategy  is  applicable  to 
other  attribute  mismatch  cases,  and  produces  intel¬ 
ligent-sounding  responses  from  the  system.  During 
the  dialogue,  as  new  information  is  obtained  from 
the  user,  the  Goal  Frame  is  updated  and  the  scores 
of  all  the  data  frames  are  reevaluated.  The  system 
may  interpret  the  new  information  as  a  positive  or 
negative.  Positives  are  added  to  the  Goal  Frame. 
Negatives  are  stored  in  a  Negative-Goal  Frame  and 
will  also  be  used  in  the  re-scoring  of  the  data 
frames,  possibly  causing  conflict  scores  to  in¬ 
crease.  The  Negative-Goal  Frame  is  created  when 
HITIQA  receives  a  negative  response  from  the 
user.  The  Negative-Goal  Frame  includes  informa¬ 
tion  that  HITIQA  has  identified  as  being  of  no  in¬ 
terest  to  the  user.  If  the  user  responds  the  equiva¬ 
lent  of  “yes”  to  the  system  clarification  question  in 
the  Sanchez  dialogue,  civil_rights  will  be  added  to 
the  topic  list  in  the  Goal  Frame  and  all  one-conflict 
frames  with  a  civil  rights  topic  will  be  re-scored  to 
Zero  conflicts,  two-conflict  frames  with 
civil  rights  as  a  topic  will  be  rescored  to  one,  etc. 
If  the  user  responds  “no”,  the  Negative-Goal 
Frame  will  be  generated  and  all  frames  with 
civil  rights  as  a  topic  will  be  rescored  to  99  in  or¬ 
der  to  remove  them  from  further  processing. 

The  clarification  dialogue  will  continue  on  the 
topic  level  until  all  the  significant  sets  of  NEAR- 
MISS  frames  are  either  included  in  the  answer 
space  (through  user  broadening  the  scope  of  the 
question  that  removes  the  initial  conflicts)  or  dis¬ 
missed  as  not  relevant.  When  HITIQA  reaches  this 
point  it  will  re-evaluate  the  data  frames  in  its  an¬ 
swer  space.  If  there  are  too  many  answer  frames 
now  (more  than  a  pre-determined  upper  threshold), 
the  dialogue  manager  will  offer  to  the  user  to  nar¬ 
row  the  question  using  another  frame  attribute.  If 
the  size  of  the  new  answer  space  is  still  too  small 
(i.e.,  there  are  many  unresolved  near- miss  frames). 


the  dialogue  manager  will  suggest  to  the  user  ways 
of  further  broadening  the  question,  thus  making 
more  data  frames  relevant,  or  possibly  retrieving 
new  documents  by  adding  terms  acquired  through 
the  clarification  dialogue.  When  the  number  of 
frames  is  within  the  acceptable  range,  HITIQA  will 
generate  the  answer  using  the  text  from  the  frames 
in  the  current  answer  space.  The  user  may  end  the 
dialogue  at  any  point  and  have  an  answer  gener¬ 
ated  given  the  current  state  of  the  frames. 

9.1  Narrowing  Dialogue 

HITIQA  attempts  to  reduce  the  number  of  frames 
judged  to  be  relevant  through  a  Narrowing  Dia¬ 
logue.  This  is  done  when  the  answer  space  con¬ 
tains  too  many  elements  to  form  a  succinct  answer. 
This  typically  happens  when  the  initial  question 
turns  out  to  be  too  vague  or  unspecific,  with  re¬ 
spect  to  the  available  data. 

9.2  Broadening  Dialogue 

As  explained  before,  the  system  may  attempt  to 
increase  the  number  of  frames  judged  relevant 
through  a  Broadening  Dialogue  (BD),  whenever 
the  answer  space  appears  too  narrow,  i.e.,  contains 
too  few  zero-conflict  frames.  We  are  conducting 
further  experiments  to  define  this  situation  more 
precisely.  Currently,  the  BD  will  only  occur  if 
there  are  one-conflict  frames,  or  near  misses. 
Broadening  questions  can  be  asked  about  any  of 
the  attributes  which  have  values  in  the  Goal  Frame. 

10  Answer  Generation 

Currently,  the  answer  is  simply  composed  of  text 
passages  from  the  zero  conflict  frames.  The  text  of 
these  frames  are  ordered  by  date  and  outputted  to 
the  user.  Typically  the  answer  to  these  analytical 
type  questions  will  require  many  pages  of  informa¬ 
tion.  Example  1  below  shows  the  first  portion  of 
the  answer  generated  by  HITIQA  for  the  Black  Sea 
query.  Current  work  is  focusing  on  answer  genera¬ 
tion. 

2002: 

The  Black  Sea  is  widely  recognized  as  one  of  the  re¬ 
gional  seas  most  damaged  by  human  activity.  Almost 
one  third  of  the  entire  land  area  of  continental  Europe 
drains  into  this  sea...  major  European  rivers,  the  Da¬ 
nube,  Dnieper  and  Don,  discharge  into  this  sea  while  its 
only  connection  to  the  world's  oceans  is  the  narrow 


Bosphorus  Strait.  The  Bosphorus  is  as  little  as  70  me¬ 
ters  deep  and  700  meters  wide  but  the  depth  of  the 
Black  Sea  itself  exceeds  two  kilometers  in  places.  Con¬ 
taminants  and  nutrients  enter  the  Black  Sea  via  river 
run-off  mainly  and  by  direct  discharge  from  land-based 
sources.  The  management  of  the  Black  Sea  itself  is  the 
shared  responsibility  of  the  six  coastal  countries:  Bul¬ 
garia,  Georgia,  Romania,  Russian  Federation,  Turkey, 
and  Ukraine... 

Example  1 :  Partial  answer  generated  by  HITIQA  to  the 
Black  Sea  query. 

11  Evaluations 

We  have  just  completed  the  first  round  of  a  pilot 
evaluation  for  testing  the  interactive  dialogue  com¬ 
ponent  of  HITIQA.  The  purpose  of  this  first  stage 
of  evaluation  is  to  determine  what  kind  of  dialogue 
is  acceptable/tolerable  to  the  user  and  whether  an 
efficient  navigation  though  the  answer  space  is 
possible.  HITIQA  was  blindly  tested  by  two  dif¬ 
ferent  analysts  on  eleven  different  topics.  Five 
different  groups  participated,  but  no  analyst  tested 
more  than  one  system,  as  system  comparison  was 
not  a  goal.  The  analysts  were  given  complete  free¬ 
dom  in  forming  their  queries  and  responses  to 
HITlQA’s  questions.  They  were  only  provided 
with  descriptions  of  the  eleven  topics  the  systems 
would  be  tested  on.  The  analysts  were  given  15 
minutes  for  each  topic  to  arrive  at  what  they  be¬ 
lieved  to  be  an  acceptable  answer.  During  testing  a 
Wizard  (human)  was  allowed  to  intervene  if 
HITIQA  generated  a  dialogue  question/response 
that  was  felt  inappropriate.  The  Wizard  was  able  to 
override  the  system  and  send  a  Wizard  generated 
question/response  to  the  analyst.  The  HITIQA 
Wizard  intervened  an  average  of  13%  of  the  time. 

These  results  are  for  information  purposes  only 
as  it  was  not  a  formal  evaluation.  HITIQA  earned 
an  average  score  of  5.8  from  both  Analysts  for  dia¬ 
logue,  where  1  was  “extremely  dissatisfied”  and  7 
was  “completely  satisfied”.  The  highest  score  pos¬ 
sible  was  a  7  for  each  dialogue.  The  Analysts  were 
asked  to  grade  each  scenario  for  success  or  failure. 
We  divide  the  failures  from  both  analysts  into  three 
categories: 

1)  the  user  gives  up  on  the  system  for  the 
given  scenario(9%) 

2)  the  15  minute  time  limit  was  up(13%) 

3)  the  data  was  not  in  the  database(9%) 
HITIQA  had  a  63%  success  rate  for  Analyst  1  and 
a  73%  success  rate  for  Analyst  2.  It  is  unclear  how 


these  results  should  be  interpreted,  if  at  all,  as  the 
evaluation  was  a  mere  pilot,  mostly  to  test  the  me¬ 
chanics  of  the  setup.  We  know  only  that  a  human 
Wizard  equipped  with  all  necessary  information 
can  easily  achieve  100%  success  in  this  test.  What 
is  still  needed  is  a  baseline  performance,  perhaps 
based  on  using  an  ordinary  keyword-based  search 
engine. 

12  Future  Work 

This  paper  describes  a  work  in  progress.  We  ex¬ 
pect  that  the  initial  specification  of  content  frame 
will  evolve  as  we  subject  the  initial  system  to  more 
demanding  evaluations.  Currently,  the  frames  are 
not  topically  specialized,  and  this  appears  the  most 
logical  next  refinement,  i.e.,  develop  several  (10- 
30)  types  of  frames  covering  different  classes  of 
events,  from  politics  to  medicine  to  science  to  in¬ 
ternational  economics,  etc.  This  is  expected  to  in¬ 
crease  the  accuracy  of  the  dialogue  as  is  the  inter¬ 
active  visualization  which  is  also  under  develop¬ 
ment.  Answer  generation  will  involve  fusion  of 
information  on  the  frame  level,  and  is  currently  in 
an  initial  phase  of  implementation. 
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